1 MVP

You are given a set of data on housing sale prices for the last few years in King County (near Seattle) between May 2014 and May 2015.


We want you to build an explanatory model for the price of housing in King County, i.e. an interpretable model in which the included variables are statistically justifiable.

The variable definitions are:

id - Unique ID for each home sold
date - Date of the home sale
price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
sqft_living - Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors
waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
view - An index from 0 to 4 of how good the view of the property was
condition - An index from 1 to 5 on the condition of the apartment
grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design
sqft_above - The square footage of the interior housing space that is above ground level
sqft_basement - The square footage of the interior housing space that is below ground level
yr_built - The year the house was initially built
yr_renovated - The year of the house’s last renovation
zipcode - What zipcode area the house is in
lat - Lattitude
long - Longitude
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors


2 Question 1

Tidy up the data ready for regression:

* You might like to think about removing some or all of `date`, `id`, `sqft_living15`, `sqft_lot15` and `zipcode` (`lat` and `long` provide a better measure of location in any event).
* Have a think about how to treat `waterfront`. Should we convert its type?
* We converted `yr_renovated` into a `renovated` logical variable, indicating whether the property had ever been renovated. You may wish to do the same.
* Have a think about how to treat `condition` and `grade`? Are they interval or categorical ordinal data types?
library(tidyverse)
## Warning: replacing previous import 'vctrs::data_frame' by
## 'tibble::data_frame' when loading 'dplyr'
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.2
library(modelr)
houses <- read_csv("data/kc_house_data.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   id = col_character(),
##   date = col_datetime(format = "")
## )
## See spec(...) for full column specifications.
glimpse(houses)
## Rows: 21,613
## Columns: 21
## $ id            <chr> "7129300520", "6414100192", "5631500400", "2487200…
## $ date          <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 2…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.…
## $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 178…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, …
## $ waterfront    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3,…
## $ grade         <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 105…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 19…
## $ yr_renovated  <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode       <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 9…
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.65…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, …
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 17…
## $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, …
# tidy up data. In particular treat condition and grade as factor, as they are
# ordinal categorical
houses_tidy <- houses %>%
  select(-c("id", "date", "sqft_living15", "sqft_lot15", "zipcode")) %>%
  mutate(waterfront = as.logical(waterfront)) %>%
  mutate(renovated = yr_renovated != 0) %>%
  select(-"yr_renovated") %>%
  mutate(condition = as_factor(condition)) %>%
  mutate(grade = as_factor(grade))

glimpse(houses_tidy)
## Rows: 21,613
## Columns: 16
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 2…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.…
## $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 178…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, …
## $ waterfront    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ condition     <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3,…
## $ grade         <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 105…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 19…
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.65…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, …
## $ renovated     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…


3 Question 2

Check for aliased variables using the alias() function (this takes in a formula object and a data set). [Hint - formula price ~ . says ‘price varying with all predictors’, this is a suitable input to alias()]. Remove variables that lead to an alias. Check the ‘Elements of multiple regression’ lesson for a dropdown containing further information on finding aliased variables in a dataset.

Solution

# Alias is useful to check if we have aliased variables, i.e. one or more
# variables that can be computed from other variables
alias(price ~ ., data = houses_tidy)
## Model :
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + view + condition + grade + sqft_above + sqft_basement + 
##     yr_built + lat + long + renovated
## 
## Complete :
##               (Intercept) bedrooms bathrooms sqft_living sqft_lot floors
## sqft_basement  0           0        0         1           0        0    
##               waterfrontTRUE view condition2 condition3 condition4
## sqft_basement  0              0    0          0          0        
##               condition5 grade3 grade4 grade5 grade6 grade7 grade8 grade9
## sqft_basement  0          0      0      0      0      0      0      0    
##               grade10 grade11 grade12 grade13 sqft_above yr_built lat long
## sqft_basement  0       0       0       0      -1          0        0   0  
##               renovatedTRUE
## sqft_basement  0
# seems that sqft_basement can be computed from sqft_living - sqft_above.
# let's drop sqft_living leaving just the two contributions sqft_basement and 
# sqft_above
houses_tidy <- houses_tidy %>%
  select(-"sqft_living")

glimpse(houses_tidy)
## Rows: 21,613
## Columns: 15
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 2…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, …
## $ waterfront    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ condition     <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3,…
## $ grade         <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 105…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 19…
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.65…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, …
## $ renovated     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…


4 Question 3

Systematically build a regression model containing up to four main effects (remember, a main effect is just a single predictor with coefficient), testing the regression diagnostics as you go

* splitting datasets into numeric and non-numeric columns might help `ggpairs()` run in manageable time, although you will need to add either a `price` or `resid` column to the non-numeric dataframe in order to see its correlations with the non-numeric predictors. You can do it in the following way:
houses_tidy_numeric <- houses_tidy %>%
  select_if(is.numeric)

houses_tidy_nonnumeric <- houses_tidy %>%
  select_if(function(x) !is.numeric(x))

houses_tidy_nonnumeric$price <- houses_tidy$price

Solution

ggpairs(houses_tidy_numeric)

ggpairs(houses_tidy_nonnumeric)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can also subset your ggpairs plot doing the following:


ggpairs(houses_tidy_numeric, columns = c(1, 2, 3, 4))

and the same in subsequent rounds of predictor selection with the resid column.

Remember, if you are not sure whether including a categorical predictor is statistically justified, run an anova() test passing in the models with- and without the categorical predictor and check the p-value of the test.

houses_tidy_numeric <- houses_tidy %>%
  select_if(is.numeric)

houses_tidy_nonnumeric <- houses_tidy %>%
  select_if(function(x) !is.numeric(x))

houses_tidy_nonnumeric$price <- houses_tidy$price

ggpairs(houses_tidy_numeric)

ggpairs(houses_tidy_nonnumeric)

Correlation of sqft_above with price looks pretty promising, but split of price by grade and waterfront also look decent.

houses_tidy %>%
  ggplot(aes(x = grade, y = price)) +
  geom_boxplot()

houses_tidy %>%
  ggplot(aes(x = waterfront, y = price)) +
  geom_boxplot()

Start building the models:

# model with price and sqft above
mod1a <- lm(price ~ sqft_above, data = houses_tidy)
summary(mod1a)
## 
## Call:
## lm(formula = price ~ sqft_above, data = houses_tidy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -913132 -165624  -41468  109327 5339232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59953.2     4729.8   12.68   <2e-16 ***
## sqft_above     268.5        2.4  111.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 292200 on 21611 degrees of freedom
## Multiple R-squared:  0.3667, Adjusted R-squared:  0.3667 
## F-statistic: 1.251e+04 on 1 and 21611 DF,  p-value: < 2.2e-16
mod1b <- lm(price ~ grade, data = houses_tidy)
summary(mod1b)
## 
## Call:
## lm(formula = price ~ grade, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1929615  -135853   -35090    89080  5565658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   142000     254499   0.558 0.576878    
## grade3         63667     293870   0.217 0.828484    
## grade4         72381     258849   0.280 0.779767    
## grade5        106524     255024   0.418 0.676169    
## grade6        159920     254561   0.628 0.529868    
## grade7        260590     254513   1.024 0.305904    
## grade8        400853     254520   1.575 0.115285    
## grade9        631513     254547   2.481 0.013112 *  
## grade10       929771     254611   3.652 0.000261 ***
## grade11      1354842     254817   5.317 1.07e-07 ***
## grade12      2049222     255909   8.008 1.23e-15 ***
## grade13      3567615     264106  13.508  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254500 on 21601 degrees of freedom
## Multiple R-squared:  0.5197, Adjusted R-squared:  0.5195 
## F-statistic:  2125 on 11 and 21601 DF,  p-value: < 2.2e-16
mod1c <- lm(price ~ waterfront, data = houses_tidy)
summary(mod1c)
## 
## Call:
## lm(formula = price ~ waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1376876  -211564   -81564   108436  7168436 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      531564       2416  220.00   <2e-16 ***
## waterfrontTRUE  1130312      27822   40.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 353900 on 21611 degrees of freedom
## Multiple R-squared:  0.07095,    Adjusted R-squared:  0.07091 
## F-statistic:  1650 on 1 and 21611 DF,  p-value: < 2.2e-16

Grade looks the most promising, but some of the grade level coefficients are insignificant.

From the F-test at the bottom of the regression output tests against the null model (i.e. intercept only). But, if we want, we can replicate this using a separate anova. For this, our null model would be to regress price on intercept only.

You can do this in the following way:

null_model <- lm(price ~ 1, data = houses_tidy)
grade_model <- lm(price ~ grade, data = houses_tidy)
anova(null_model, grade_model)
# grade is significant, let's keep it. Now plot diagnostics
par(mfrow = c(2, 2))
plot(mod1b)
## Warning: not plotting observations with leverage one:
##   19453

## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod1b) %>%
  select(-c("price", "grade"))

houses_resid_numeric <- houses_resid %>%
  select_if(is.numeric)

houses_resid_nonnumeric <- houses_resid %>%
  select_if(function(x) !is.numeric(x))

houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)

ggpairs(houses_resid_nonnumeric)

Our predictor lat has highest correlation with residuals, but, again, waterfront still looks pretty promising. Try both…

mod2a <- lm(price ~ grade + lat, data = houses_tidy)
summary(mod2a)
## 
## Call:
## lm(formula = price ~ grade + lat, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1863783  -112379   -28181    67454  5533910 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29949769     610666 -49.044  < 2e-16 ***
## grade3         187923     276126   0.681 0.496151    
## grade4          95367     243212   0.392 0.694978    
## grade5         126354     239618   0.527 0.597980    
## grade6         158453     239183   0.662 0.507673    
## grade7         246275     239138   1.030 0.303093    
## grade8         378635     239144   1.583 0.113369    
## grade9         603010     239170   2.521 0.011701 *  
## grade10        891752     239230   3.728 0.000194 ***
## grade11       1311124     239425   5.476  4.4e-08 ***
## grade12       2007093     240450   8.347  < 2e-16 ***
## grade13       3502099     248154  14.113  < 2e-16 ***
## lat            633100      11822  53.554  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239100 on 21600 degrees of freedom
## Multiple R-squared:  0.576,  Adjusted R-squared:  0.5758 
## F-statistic:  2445 on 12 and 21600 DF,  p-value: < 2.2e-16
mod2b <- lm(price ~ grade + waterfront, data = houses_tidy)
summary(mod2b)
## 
## Call:
## lm(formula = price ~ grade + waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1929615  -132969   -32393    91481  4778940 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      142000     244366   0.581   0.5612    
## grade3            63667     282169   0.226   0.8215    
## grade4            72381     248543   0.291   0.7709    
## grade5            92834     244870   0.379   0.7046    
## grade6           155043     244426   0.634   0.5259    
## grade7           258469     244379   1.058   0.2902    
## grade8           395393     244386   1.618   0.1057    
## grade9           623595     244413   2.551   0.0107 *  
## grade10          909321     244474   3.719   0.0002 ***
## grade11         1313326     244674   5.368 8.06e-08 ***
## grade12         1947993     245731   7.927 2.35e-15 ***
## grade13         3567615     253590  14.068  < 2e-16 ***
## waterfrontTRUE   828234      19363  42.773  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 244400 on 21600 degrees of freedom
## Multiple R-squared:  0.5572, Adjusted R-squared:  0.557 
## F-statistic:  2265 on 12 and 21600 DF,  p-value: < 2.2e-16
# lat is significant and higher r^2, let's keep model2a
par(mfrow = c(2, 2))
plot(mod2a)
## Warning: not plotting observations with leverage one:
##   19453

## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod2a) %>%
  select(-c("price", "grade", "lat"))

houses_resid_numeric <- houses_resid %>%
  select_if(is.numeric)

houses_resid_nonnumeric <- houses_resid %>%
  select_if(function(x) !is.numeric(x))

houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)

ggpairs(houses_resid_nonnumeric)

Now view has strongest correlation with residuals, but also compare against model with waterfront.

mod3a <- lm(price ~ grade + lat + view, data = houses_tidy)
summary(mod3a)
## 
## Call:
## lm(formula = price ~ grade + lat + view, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1665265  -105866   -21393    69623  5429667 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -30514024     576750 -52.907  < 2e-16 ***
## grade3         190253     260743   0.730 0.465607    
## grade4          81058     229663   0.353 0.724133    
## grade5         112154     226269   0.496 0.620134    
## grade6         148515     225858   0.658 0.510827    
## grade7         235345     225815   1.042 0.297328    
## grade8         351873     225822   1.558 0.119203    
## grade9         556584     225848   2.464 0.013731 *  
## grade10        821118     225907   3.635 0.000279 ***
## grade11       1200228     226097   5.308 1.12e-07 ***
## grade12       1832950     227080   8.072 7.28e-16 ***
## grade13       3303587     234361  14.096  < 2e-16 ***
## lat            644972      11166  57.764  < 2e-16 ***
## view           106862       2086  51.234  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 225800 on 21599 degrees of freedom
## Multiple R-squared:  0.6219, Adjusted R-squared:  0.6217 
## F-statistic:  2733 on 13 and 21599 DF,  p-value: < 2.2e-16
mod3b <- lm(price ~ grade + lat + waterfront, data = houses_tidy)
summary(mod3b)
## 
## Call:
## lm(formula = price ~ grade + lat + waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1862555  -109244   -24781    70145  4724767 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -30511083     581588 -52.462  < 2e-16 ***
## grade3            190241     262923   0.724 0.469343    
## grade4             95796     231583   0.414 0.679130    
## grade5            112654     228161   0.494 0.621488    
## grade6            153414     227746   0.674 0.500562    
## grade7            243828     227703   1.071 0.284264    
## grade8            372609     227709   1.636 0.101783    
## grade9            594340     227734   2.610 0.009066 ** 
## grade10           870026     227792   3.819 0.000134 ***
## grade11          1267641     227979   5.560 2.72e-08 ***
## grade12          1902270     228964   8.308  < 2e-16 ***
## grade13          3500877     236288  14.816  < 2e-16 ***
## lat               644910      11259  57.278  < 2e-16 ***
## waterfrontTRUE    851219      18046  47.168  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 227700 on 21599 degrees of freedom
## Multiple R-squared:  0.6156, Adjusted R-squared:  0.6154 
## F-statistic:  2661 on 13 and 21599 DF,  p-value: < 2.2e-16
# view model is best, keep mod3a
par(mfrow = c(2, 2))
plot(mod3a)
## Warning: not plotting observations with leverage one:
##   19453

## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod3a) %>%
  select(-c("price", "grade", "lat", "view"))

houses_resid_numeric <- houses_resid %>%
  select_if(is.numeric)

houses_resid_nonnumeric <- houses_resid %>%
  select_if(function(x) !is.numeric(x))

houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)

ggpairs(houses_resid_nonnumeric)

sqft_basement has highest correlation with residuals. Let’s test against all remaining categorical predictors:

mod4a <- lm(price ~ grade + lat + view + sqft_basement, data = houses_tidy)
summary(mod4a)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1569582  -105308   -18663    70867  5230039 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.875e+07  5.647e+05 -50.913  < 2e-16 ***
## grade3         1.830e+05  2.542e+05   0.720  0.47164    
## grade4         8.134e+04  2.239e+05   0.363  0.71638    
## grade5         1.088e+05  2.206e+05   0.493  0.62197    
## grade6         1.355e+05  2.202e+05   0.615  0.53841    
## grade7         2.042e+05  2.201e+05   0.927  0.35373    
## grade8         3.194e+05  2.201e+05   1.451  0.14687    
## grade9         5.281e+05  2.202e+05   2.399  0.01646 *  
## grade10        7.859e+05  2.202e+05   3.568  0.00036 ***
## grade11        1.156e+06  2.204e+05   5.244 1.59e-07 ***
## grade12        1.765e+06  2.214e+05   7.974 1.61e-15 ***
## grade13        3.169e+06  2.285e+05  13.869  < 2e-16 ***
## lat            6.079e+05  1.094e+04  55.559  < 2e-16 ***
## view           8.903e+04  2.101e+03  42.367  < 2e-16 ***
## sqft_basement  1.204e+02  3.582e+00  33.600  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220100 on 21598 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6405 
## F-statistic:  2751 on 14 and 21598 DF,  p-value: < 2.2e-16
mod4b <- lm(price ~ grade + lat + view + waterfront, data = houses_tidy)
summary(mod4b)
## 
## Call:
## lm(formula = price ~ grade + lat + view + waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1713543  -105538   -20871    69842  4901827 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -30758475     564698 -54.469  < 2e-16 ***
## grade3            191262     255269   0.749 0.453709    
## grade4             84892     224841   0.378 0.705758    
## grade5            106294     221518   0.480 0.631345    
## grade6            147526     221116   0.667 0.504659    
## grade7            236375     221074   1.069 0.284985    
## grade8            354372     221081   1.603 0.108970    
## grade9            562139     221106   2.542 0.011016 *  
## grade10           823728     221164   3.725 0.000196 ***
## grade11          1197915     221350   5.412 6.30e-08 ***
## grade12          1804315     222314   8.116 5.07e-16 ***
## grade13          3351868     229446  14.609  < 2e-16 ***
## lat               650115      10932  59.466  < 2e-16 ***
## view               80422       2217  36.273  < 2e-16 ***
## waterfrontTRUE    582422      19024  30.615  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221100 on 21598 degrees of freedom
## Multiple R-squared:  0.6377, Adjusted R-squared:  0.6374 
## F-statistic:  2715 on 14 and 21598 DF,  p-value: < 2.2e-16
mod4c <- lm(price ~ grade + lat + view + condition, data = houses_tidy)
summary(mod4c)
## 
## Call:
## lm(formula = price ~ grade + lat + view + condition, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1665132  -105472   -18392    71578  5447648 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.043e+07  5.673e+05 -53.640  < 2e-16 ***
## grade3       1.472e+05  2.595e+05   0.567 0.570486    
## grade4       6.481e+04  2.291e+05   0.283 0.777293    
## grade5       7.656e+04  2.258e+05   0.339 0.734605    
## grade6       1.165e+05  2.257e+05   0.516 0.605628    
## grade7       2.090e+05  2.257e+05   0.926 0.354406    
## grade8       3.359e+05  2.257e+05   1.488 0.136720    
## grade9       5.476e+05  2.257e+05   2.426 0.015270 *  
## grade10      8.154e+05  2.258e+05   3.612 0.000305 ***
## grade11      1.199e+06  2.260e+05   5.307 1.12e-07 ***
## grade12      1.834e+06  2.269e+05   8.082 6.69e-16 ***
## grade13      3.310e+06  2.339e+05  14.148  < 2e-16 ***
## lat          6.432e+05  1.098e+04  58.552  < 2e-16 ***
## view         1.017e+05  2.058e+03  49.425  < 2e-16 ***
## condition2  -4.275e+02  4.466e+04  -0.010 0.992363    
## condition3  -6.453e+03  4.154e+04  -0.155 0.876556    
## condition4   5.758e+04  4.158e+04   1.385 0.166054    
## condition5   1.349e+05  4.180e+04   3.226 0.001258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221800 on 21595 degrees of freedom
## Multiple R-squared:  0.6352, Adjusted R-squared:  0.635 
## F-statistic:  2212 on 17 and 21595 DF,  p-value: < 2.2e-16
mod4d <- lm(price ~ grade + lat + view + renovated, data = houses_tidy)
summary(mod4d)
## 
## Call:
## lm(formula = price ~ grade + lat + view + renovated, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1637628  -105101   -19410    70577  5280904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -30179084     571198 -52.835  < 2e-16 ***
## grade3           188870     258132   0.732 0.464372    
## grade4            75932     227363   0.334 0.738407    
## grade5           108602     224003   0.485 0.627808    
## grade6           141362     223596   0.632 0.527250    
## grade7           229992     223554   1.029 0.303586    
## grade8           346281     223561   1.549 0.121413    
## grade9           550447     223587   2.462 0.013828 *  
## grade10          817501     223645   3.655 0.000257 ***
## grade11         1199432     223833   5.359 8.47e-08 ***
## grade12         1835525     224806   8.165 3.39e-16 ***
## grade13         3275946     232018  14.119  < 2e-16 ***
## lat              637925      11059  57.684  < 2e-16 ***
## view             102285       2076  49.261  < 2e-16 ***
## renovatedTRUE    159554       7606  20.979  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 223500 on 21598 degrees of freedom
## Multiple R-squared:  0.6295, Adjusted R-squared:  0.6292 
## F-statistic:  2621 on 14 and 21598 DF,  p-value: < 2.2e-16
# looks like model with sqft_basement is best, keep mod4a
par(mfrow = c(2, 2))
plot(mod4a)
## Warning: not plotting observations with leverage one:
##   19453

## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod4a) %>%
  select(- price)

Our final model in terms of main effects is: price ~ grade + lat + view + sqft_basement


5 Extensions

  • Consider possible interactions between your four main effect predictors and test their effect upon \(r^2\). Choose your best candidate interaction and visualise its effect.

  • Calculate the relative importance of predictors from your best \(4\)-predictor model (i.e. the model without an interaction). Which predictor affects price most strongly?

Solution

Now, for interactions, have six possibilities that obey principle of strong hierarchy (i.e. consider including an interaction only if its main effects are already present in the model)

mod5a <- lm(price ~ grade + lat + view + sqft_basement + grade:lat, data = houses_tidy)
summary(mod5a)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:lat, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1496592  -101302   -20837    67958  5196329 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.375e+07  4.424e+07  -0.989   0.3228    
## grade3         3.068e+07  6.120e+07   0.501   0.6162    
## grade4         3.760e+07  4.798e+07   0.784   0.4332    
## grade5         2.917e+07  4.457e+07   0.654   0.5129    
## grade6         2.420e+07  4.428e+07   0.546   0.5848    
## grade7         1.912e+07  4.425e+07   0.432   0.6656    
## grade8         1.546e+07  4.425e+07   0.349   0.7267    
## grade9        -7.530e+05  4.427e+07  -0.017   0.9864    
## grade10       -1.492e+07  4.435e+07  -0.336   0.7366    
## grade11       -1.469e+07  4.466e+07  -0.329   0.7422    
## grade12       -8.399e+07  4.632e+07  -1.813   0.0698 .  
## grade13        3.135e+06  2.463e+05  12.728   <2e-16 ***
## lat            9.234e+05  9.308e+05   0.992   0.3212    
## view           8.949e+04  2.086e+03  42.892   <2e-16 ***
## sqft_basement  1.208e+02  3.556e+00  33.968   <2e-16 ***
## grade3:lat    -6.429e+05  1.290e+06  -0.498   0.6183    
## grade4:lat    -7.898e+05  1.010e+06  -0.782   0.4340    
## grade5:lat    -6.115e+05  9.378e+05  -0.652   0.5143    
## grade6:lat    -5.062e+05  9.315e+05  -0.543   0.5869    
## grade7:lat    -3.980e+05  9.310e+05  -0.428   0.6690    
## grade8:lat    -3.186e+05  9.310e+05  -0.342   0.7322    
## grade9:lat     2.662e+04  9.315e+05   0.029   0.9772    
## grade10:lat    3.296e+05  9.330e+05   0.353   0.7239    
## grade11:lat    3.325e+05  9.395e+05   0.354   0.7234    
## grade12:lat    1.801e+06  9.745e+05   1.848   0.0646 .  
## grade13:lat           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218400 on 21588 degrees of freedom
## Multiple R-squared:  0.6464, Adjusted R-squared:  0.646 
## F-statistic:  1644 on 24 and 21588 DF,  p-value: < 2.2e-16
mod5b <- lm(price ~ grade + lat + view + sqft_basement + grade:view, data = houses_tidy)
summary(mod5b)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:view, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1878838  -104533   -20264    69483  5165978 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.871e+07  5.606e+05 -51.208  < 2e-16 ***
## grade3         1.828e+05  2.523e+05   0.725 0.468734    
## grade4         8.617e+04  2.226e+05   0.387 0.698645    
## grade5         1.122e+05  2.189e+05   0.513 0.608291    
## grade6         1.379e+05  2.185e+05   0.631 0.528067    
## grade7         2.081e+05  2.185e+05   0.952 0.340866    
## grade8         3.222e+05  2.185e+05   1.475 0.140313    
## grade9         5.373e+05  2.185e+05   2.459 0.013953 *  
## grade10        7.518e+05  2.186e+05   3.439 0.000585 ***
## grade11        1.084e+06  2.189e+05   4.953 7.35e-07 ***
## grade12        1.741e+06  2.208e+05   7.887 3.23e-15 ***
## grade13        2.594e+06  2.372e+05  10.939  < 2e-16 ***
## lat            6.070e+05  1.086e+04  55.879  < 2e-16 ***
## view           4.024e+05  3.764e+04  10.689  < 2e-16 ***
## sqft_basement  1.178e+02  3.570e+00  32.987  < 2e-16 ***
## grade3:view           NA         NA      NA       NA    
## grade4:view   -3.485e+05  1.009e+05  -3.454 0.000553 ***
## grade5:view   -3.381e+05  4.384e+04  -7.712 1.29e-14 ***
## grade6:view   -3.360e+05  3.893e+04  -8.631  < 2e-16 ***
## grade7:view   -3.454e+05  3.794e+04  -9.104  < 2e-16 ***
## grade8:view   -3.214e+05  3.782e+04  -8.498  < 2e-16 ***
## grade9:view   -3.326e+05  3.788e+04  -8.782  < 2e-16 ***
## grade10:view  -2.595e+05  3.800e+04  -6.828 8.81e-12 ***
## grade11:view  -2.424e+05  3.843e+04  -6.307 2.89e-10 ***
## grade12:view  -2.972e+05  4.002e+04  -7.428 1.14e-13 ***
## grade13:view          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218500 on 21589 degrees of freedom
## Multiple R-squared:  0.6462, Adjusted R-squared:  0.6459 
## F-statistic:  1715 on 23 and 21589 DF,  p-value: < 2.2e-16
mod5c <- lm(price ~ grade + lat + view + sqft_basement + grade:sqft_basement, data = houses_tidy)
summary(mod5c)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:sqft_basement, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2093544  -102714   -19903    68417  4922552 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.922e+07  5.576e+05 -52.398  < 2e-16 ***
## grade3                 1.849e+05  2.507e+05   0.737 0.460910    
## grade4                 7.737e+04  2.210e+05   0.350 0.726274    
## grade5                 1.098e+05  2.176e+05   0.504 0.613935    
## grade6                 1.427e+05  2.172e+05   0.657 0.511185    
## grade7                 2.183e+05  2.172e+05   1.005 0.314862    
## grade8                 3.305e+05  2.172e+05   1.522 0.128016    
## grade9                 5.195e+05  2.172e+05   2.392 0.016782 *  
## grade10                7.595e+05  2.173e+05   3.495 0.000475 ***
## grade11                1.067e+06  2.176e+05   4.906 9.38e-07 ***
## grade12                1.603e+06  2.191e+05   7.318 2.61e-13 ***
## grade13                2.047e+06  2.372e+05   8.629  < 2e-16 ***
## lat                    6.177e+05  1.080e+04  57.166  < 2e-16 ***
## view                   8.401e+04  2.100e+03  40.011  < 2e-16 ***
## sqft_basement          9.153e+02  5.204e+01  17.589  < 2e-16 ***
## grade3:sqft_basement          NA         NA      NA       NA    
## grade4:sqft_basement  -6.667e+01  1.106e+03  -0.060 0.951943    
## grade5:sqft_basement  -7.957e+02  1.065e+02  -7.470 8.33e-14 ***
## grade6:sqft_basement  -8.503e+02  5.511e+01 -15.429  < 2e-16 ***
## grade7:sqft_basement  -8.442e+02  5.238e+01 -16.118  < 2e-16 ***
## grade8:sqft_basement  -8.273e+02  5.241e+01 -15.785  < 2e-16 ***
## grade9:sqft_basement  -7.618e+02  5.272e+01 -14.451  < 2e-16 ***
## grade10:sqft_basement -7.237e+02  5.317e+01 -13.611  < 2e-16 ***
## grade11:sqft_basement -6.234e+02  5.421e+01 -11.499  < 2e-16 ***
## grade12:sqft_basement -5.887e+02  5.646e+01 -10.427  < 2e-16 ***
## grade13:sqft_basement         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 217100 on 21589 degrees of freedom
## Multiple R-squared:  0.6505, Adjusted R-squared:  0.6502 
## F-statistic:  1747 on 23 and 21589 DF,  p-value: < 2.2e-16
mod5d <- lm(price ~ grade + lat + view + sqft_basement + lat:view, data = houses_tidy)
summary(mod5d)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + lat:view, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1544431  -103361   -20451    68950  5201833 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.680e+07  5.780e+05 -46.362  < 2e-16 ***
## grade3         1.749e+05  2.530e+05   0.691 0.489328    
## grade4         7.915e+04  2.228e+05   0.355 0.722429    
## grade5         1.100e+05  2.195e+05   0.501 0.616167    
## grade6         1.372e+05  2.191e+05   0.626 0.531099    
## grade7         2.062e+05  2.191e+05   0.941 0.346529    
## grade8         3.209e+05  2.191e+05   1.464 0.143085    
## grade9         5.296e+05  2.191e+05   2.417 0.015660 *  
## grade10        7.858e+05  2.192e+05   3.585 0.000338 ***
## grade11        1.152e+06  2.194e+05   5.251 1.53e-07 ***
## grade12        1.757e+06  2.203e+05   7.974 1.61e-15 ***
## grade13        3.133e+06  2.274e+05  13.776  < 2e-16 ***
## lat            5.668e+05  1.125e+04  50.366  < 2e-16 ***
## view          -1.069e+07  7.448e+05 -14.352  < 2e-16 ***
## sqft_basement  1.188e+02  3.567e+00  33.320  < 2e-16 ***
## lat:view       2.266e+05  1.566e+04  14.471  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219100 on 21597 degrees of freedom
## Multiple R-squared:  0.6442, Adjusted R-squared:  0.6439 
## F-statistic:  2607 on 15 and 21597 DF,  p-value: < 2.2e-16
mod5e <- lm(price ~ grade + lat + view + sqft_basement + lat:sqft_basement, data = houses_tidy)
summary(mod5e)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + lat:sqft_basement, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1550147  -101737   -18763    68303  5174888 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.488e+07  6.359e+05 -39.124  < 2e-16 ***
## grade3             1.670e+05  2.532e+05   0.659 0.509600    
## grade4             7.862e+04  2.230e+05   0.353 0.724442    
## grade5             1.072e+05  2.197e+05   0.488 0.625492    
## grade6             1.362e+05  2.193e+05   0.621 0.534705    
## grade7             2.060e+05  2.193e+05   0.939 0.347638    
## grade8             3.210e+05  2.193e+05   1.464 0.143228    
## grade9             5.299e+05  2.193e+05   2.416 0.015700 *  
## grade10            7.885e+05  2.194e+05   3.594 0.000326 ***
## grade11            1.158e+06  2.196e+05   5.274 1.35e-07 ***
## grade12            1.765e+06  2.205e+05   8.004 1.26e-15 ***
## grade13            3.143e+06  2.276e+05  13.809  < 2e-16 ***
## lat                5.264e+05  1.256e+04  41.918  < 2e-16 ***
## view               8.911e+04  2.093e+03  42.569  < 2e-16 ***
## sqft_basement     -1.698e+04  1.310e+03 -12.968  < 2e-16 ***
## lat:sqft_basement  3.595e+02  2.752e+01  13.060  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219300 on 21597 degrees of freedom
## Multiple R-squared:  0.6435, Adjusted R-squared:  0.6433 
## F-statistic:  2599 on 15 and 21597 DF,  p-value: < 2.2e-16
mod5f <- lm(price ~ grade + lat + view + sqft_basement + view:sqft_basement, data = houses_tidy)
summary(mod5f)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + view:sqft_basement, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1638784  -104221   -19570    69709  5166209 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.883e+07  5.636e+05 -51.155  < 2e-16 ***
## grade3              1.833e+05  2.537e+05   0.723 0.469931    
## grade4              8.399e+04  2.234e+05   0.376 0.707022    
## grade5              1.118e+05  2.201e+05   0.508 0.611432    
## grade6              1.385e+05  2.197e+05   0.630 0.528579    
## grade7              2.087e+05  2.197e+05   0.950 0.342139    
## grade8              3.238e+05  2.197e+05   1.474 0.140599    
## grade9              5.312e+05  2.197e+05   2.417 0.015645 *  
## grade10             7.884e+05  2.198e+05   3.587 0.000335 ***
## grade11             1.155e+06  2.200e+05   5.250 1.54e-07 ***
## grade12             1.742e+06  2.210e+05   7.886 3.27e-15 ***
## grade13             3.138e+06  2.281e+05  13.756  < 2e-16 ***
## lat                 6.096e+05  1.092e+04  55.822  < 2e-16 ***
## view                7.106e+04  2.847e+03  24.962  < 2e-16 ***
## sqft_basement       1.056e+02  3.910e+00  27.003  < 2e-16 ***
## view:sqft_basement  2.870e+01  3.073e+00   9.339  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219700 on 21597 degrees of freedom
## Multiple R-squared:  0.6422, Adjusted R-squared:  0.6419 
## F-statistic:  2584 on 15 and 21597 DF,  p-value: < 2.2e-16
# mod5c looks like the best
par(mfrow = c(2,2))
plot(mod5c)
## Warning: not plotting observations with leverage one:
##   8620, 19453

## Warning: not plotting observations with leverage one:
##   8620, 19453

It seems that the grade:sqft_basement interaction leads to highest \(r^2\) (but two levels of the interaction cannot be determined due to fitting problems).

Now let’s see a visualisation of the effect of this interaction.

houses_resid %>%
  ggplot(aes(x = sqft_basement, y = resid, colour = grade)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ grade)
## `geom_smooth()` using formula 'y ~ x'

Relative importance of predictors:

library(relaimpo)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: boot
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
## Loading required package: mitools
## This is the global version of package relaimpo.
## If you are a non-US user, a version with the interesting additional metric pmvd is available
## from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
calc.relimp(mod4a, method = "lmg", rela = TRUE)
## Response variable: price 
## Total response variance: 134782378397 
## Analysis based on 21613 observations 
## 
## 14 Regressors: 
## Some regressors combined in groups: 
##         Group  grade : grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 grade13 
## 
##  Relative importance of 4 (groups of) regressors assessed: 
##  grade lat view sqft_basement 
##  
## Proportion of variance explained by model: 64.07%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                      lmg
## grade         0.67013680
## lat           0.10999806
## view          0.13594911
## sqft_basement 0.08391603
## 
## Average coefficients for different model sizes: 
## 
##                     1group      2groups      3groups      4groups
## grade3          63666.6672  105085.5034  144251.4794  182970.8245
## grade4          72381.0351   74840.3850   77838.0192   81340.5229
## grade5         106523.9716  106120.0182  106860.8796  108762.8919
## grade6         159919.6380  148997.5902  140821.1511  135460.9340
## grade7         260590.2629  235879.0272  216962.5200  204157.6486
## grade8         400852.7662  366263.3497  339139.6632  319376.1318
## grade9         631513.1864  588676.4480  554462.1758  528149.5236
## grade10        929771.0746  870400.5336  822843.4234  785879.9257
## grade11       1354841.7274 1272639.8932 1206980.5470 1155890.1732
## grade12       2049222.0006 1930531.4478 1836851.8205 1765372.6480
## grade13       3567615.3852 3398156.8184 3266023.2283 3169155.3827
## lat            813411.5832  722508.3996  660525.2637  607867.6492
## view           190335.2479  151137.4701  117930.1234   89031.6774
## sqft_basement     268.6136     203.8056     154.5114     120.3599

It looks like the grade of property is the most important determiner of price, followed by the number of views the property has received.